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Abstract 

Many probabilistic models introduce strong 
dependencies between variables using a latent 
multivariate Gaussian distribution or a Gaus- 
sian process. We present a new Markov chain 
Monte Carlo algorithm for performing infer- 
ence in models with multivariate Gaussian 
priors. Its key properties are: 1) it has simple, 
generic code applicable to many models, 2) it 
has no free parameters, 3) it works well for 
a variety of Gaussian process based models. 
These properties make our method ideal for 
use while model building, removing the need 
to spend time deriving and tuning updates 
for more complex algorithms. 

1 Introduction 

The multivariate Gaussian distribution is commonly 
used to specify a priori beliefs about dependencies 
between latent variables in probabilistic models. The 
parameters of such a Gaussian may be specified directly, 
as in graphical models and Markov random fields, or 
implicitly as the marginals of a Gaussian process (GP). 
Gaussian processes may be used to express concepts of 
spatial or temporal coherence, or may more generally 
be used to construct Bayesian kernel methods for non- 
parametric regression and classification. Rasmusscn 
and Williams (2006) provide a recent review of GPs. 

Inferences can only be calculated in closed form for 
the simplest Gaussian latent variable models. Recent 
work shows that posterior marginals can sometimes be 
well approximated with deterministic methods (Kuss 
and Rasmussen, 2005; Rue et al., 2009). Markov chain 
Monte Carlo (MCMC) methods represent joint pos- 
terior distributions with samples (e.g. Neal, 1993). 
MCMC can be slower but applies more generally. 
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In some circumstances MCMC provides good results 
with minimal model-specific implementation. Gibbs 
sampling, in particular, is frequently used to sample 
from probabilistic models in a straightforward way, up- 
dating one variable at a time. In models with strong 
dependencies among variables, including many with 
Gaussian priors, Gibbs sampling is known to perform 
poorly. Several authors have previously addressed the 
issue of sampling from models containing strongly cor- 
related Gaussians, notably the recent work of Titsias 
et al. (2009). In this paper we provide a technique 
called elliptical slice sampling that is simpler and of- 
ten faster than other methods, while also removing the 
need for preliminary tuning runs. Our method provides 
a drop-in replacement for MCMC samplers of Gaussian 
models that are currently using Gibbs or Metropolis- 
Hastings and we demonstrate empirical success against 
competing methods with several different GP-based 
likelihood models. 

2 Elliptical slice sampling 

Our objective is to sample from a posterior distri- 
bution over latent variables that is proportional to 
the product of a multivariate Gaussian prior and a 
likelihood function that ties the latent variables to the 
observed data. We will use f to indicate the vector 
of latent variables that we wish to sample and denote 
a zero-mean Gaussian distribution with covariance S by 

7V(f;0,S) = |27rS|-V 2 exp( -\ t T YrH). (1) 

We also use f <~ -A/(0, S) to state that f is drawn from 
a distribution with the density in (1). Gaussians 
with non-zero means can simply be shifted to have 
zero-mean with a change of variables; an example 
will be given in Section 3.3. We use L(f) = p(data|f) 
to denote the likelihood function so that our target 
distribution for the MCMC sampler is 

p*(f) = i^(f;0,S)L(f), (2) 

where Z is the normalization constant, or the marginal 
likelihood, of the model. 

Our starting point is a Metropolis-Hastings method 
introduced by Neal (1999). Given an initial state f, a 
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new state 

f = y/l-e 2 f + ev, v ~ 7V(0, £) (3) 

is proposed, where e e [—1,1] is a step-size parameter. 
The proposal is a sample from the prior for e = 1 and 
more conservative for values closer to zero. The move 
is accepted with probability 

p(accept) = min(l, L(f')/L(f)), (4) 

otherwise the next state in the chain is a copy of f . 

Neal reported that for some Gaussian process classifiers 
the Metropolis-Hastings method was many times faster 
than Gibbs sampling. The method is also simpler to 
implement and can immediately be applied to a much 
wider variety of models with Gaussian priors. 

A drawback, identified by Neal (1999), is that the 
step-size e needs to be chosen appropriately for the 
Markov chain to mix efficiently. This may require 
preliminary runs. Usually parameters of the covariance 
E and likelihood function L are also inferred from data. 
Different step-size parameters may be needed as the 
model parameters are updated. It would be desirable 
to automatically search over the step-size parameter, 
while maintaining a valid algorithm. 

For a fixed auxiliary random draw, v> , the locus of 
possible proposals by varying e G [—1,1] in (3) is half 
of an ellipse. 

f = v sin6> + f cos0, (5) 

defining a full ellipse passing through the current state f 
and the auxiliary draw v. For a fixed 9 there is an 
equivalent e that gives the same proposal distribution 
in the original algorithm. However, if we can search 
over the step-size, the full ellipse gives a richer choice 
of updates for a given v. 

2.1 Sampling an alternative model 

'Slice sampling' (Neal, 2003) provides a way to sample 
along a line with an adaptive step-size. Proposals are 
drawn from an interval or 'bracket' which, if too large, 
is shrunk automatically until an acceptable point is 
found. There arc also ways to automatically enlarge 
small initial brackets. Naively applying these adaptive 
algorithms to select the value of e in (3) or 9 in (5) 
does not result in a Markov chain transition operator 
with the correct stationary distribution. The locus of 
states is defined using the current position f , which 
upsets the reversibility and correctness of the update. 

We would like to construct a valid Markov chain tran- 
sition operator on the ellipse of states that uses slice 
sampling's existing ability to adaptively pick step sizes. 



Input: current state f, a routine that samples from 
A/"(0, E), log-likelihood function logL. 

Output: a new state f. When f is drawn from p*(f)oc 
A/"(f; 0, S) L(f), the marginal distribution of f' is also p*. 

1. Sample from p(uo,u\,9\ (i/o sin0 + i/i cos# = f) ): 

9 ~ Uniform[0, 2tt] 

v ~.A/"(0,E) 
va 4- f sin 9 + v cos 9 
i/i 4— f cos 9 — 1/ sin 9 

2. Update 9 € [0, 2-k] using slice sampling (Neal, 2003) 
on: 

p*(9 1 i/o, vi) oc L(y o sin# + v\ cosf?) 

3. return f ' = vq sin 9 + v\ cos 9 

Figure 1: Intuition behind elliptical slice sampling. This 
is a valid algorithm, but will be adapted (Figure 2). 

We will first intuitively construct a valid method by 
positing an augmented probabilistic model in which 
the step-size is a variable. Standard slice sampling algo- 
rithms then apply to that model. We will then adjust 
the algorithm for our particular setting to provide a 
second, slightly tidier algorithm. 

Our augmented probabilistic model replaces the origi- 
nal latent variable with prior f ~ A/"(0, E) with 

i/o ~ JV(0, £) 
^(0,E) 

9 ~ Uniform[0, 2tt] V ' 

f = i/o sin 6 + i/i cos 9. 

The marginal distribution over the original latent vari- 
able f is still Af(0, E), so the distribution over data 
is identical. However, we can now sample from the 
posterior over the new latent variables: 

p> , u u 6) oc W>o; 0, £) JV(i/i ; 0, E) L(f (i/„, u u 0)), 

and use the values of f deterministically derived from 
these samples. Our first approach applies two Monte 
Carlo transition operators that leave the new latent 
posterior invariant. 

Operator 1: jointly resample the latents v$,Vi,9 
given the constraint that f (i/q, isi, 9) is unchanged. Be- 
cause the effective variable of interest doesn't change, 
the likelihood does not affect this conditional distribu- 
tion, so the update is generic and easy to implement. 

Operator 2: use a standard slice sampling algorithm 
to update the step-size 9 given the other variables. 

The resulting algorithm is given in Figure 1. The 
auxiliary model construction makes the link to slice 
sampling explicit, which makes it easy to understand 
the validity of the approach. However, the algorithm 
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can be neater and the [0, 2tt] range for slice sampling 
is unnatural on an ellipse. The algorithm that we will 
present in detail results from eliminating Vq and V\ 
and a different way of setting slice sampling's initial 
proposal range. The precise connection will be given 
in Section 2.4. A more direct, technical proof that 
the equilibrium distribution of the Markov chain is the 
target distribution is presented in Section 2.3. 

Elliptical slice sampling, our proposed algorithm is 
given in Figure 2, which includes the details of the slice 
sampler. An example run is illustrated in Figure 3(a-d). 
Even for high-dimensional problems, the states consid- 
ered within one update lie in a two-dimensional plane. 
In high dimensions f and u are likely to have similar 
lengths and be an angle of 7r/2 apart. Therefore the 
ellipse will typically be fairly close to a circle, although 
this is not required for the validity of the algorithm. 

As intended, our slice sampling approach selects a 
new location on the randomly generated ellipse in (5). 
There are no rejections: the new state f is never equal 
to the current state f unless that is the only state on 
the ellipse with non-zero likelihood. The algorithm 
proposes the angle 9 from a bracket [0 m in, 9 ma _ x ] which 
is shrunk exponentially quickly until an acceptable 
state is found. Thus the step size is effectively adapted 
on each iteration for the current v and S. 

2.2 Computational cost 

Drawing u costs 0(N 3 ), for A-dimensional f and gen- 
eral S. The usual implementation of a Gaussian sam- 
pler would involve caching a (Cholesky) decomposition 
of S, such that draws on subsequent iterations cost 
0(N 2 ). For some problems with special structure draw- 
ing samples from the Gaussian prior can be cheaper. 

In many models the Gaussian prior distribution cap- 
tures dependencies: the observations are independent 
conditioned on f. In these cases, computing L(f) will 
cost O(N) computation. As a result, drawing the v 
random variate will be the dominant cost of the update 
in many high-dimensional problems. In these cases 
the extra cost of elliptical slice sampling over Neal's 
Metropolis-Hastings algorithm will be small. 

As a minor performance improvement, our implementa- 
tion optionally accepts the log-likelihood of the initial 
state, if known from a previous update, so that it 
doesn't need to be recomputed in step 2. 

2.3 Validity 

Elliptical slice sampling considers settings of an angle 
variable, 9. Figure 2 presented the algorithm as it 
would be used: there is no need to index or remember 
the visited angles. For the purposes of analysis we 



Input: current state f, a routine that samples from 
A/"(0,E), log-likelihood function logL. 



Output: a new state f'. When f is drawn from p*(f)oc 
A/"(f; 0, £) L(f), the marginal distribution of f' is also p*. 



1. 


Gnoose ellipse: u ~ AI (U, zjJ 


2. 


X 1*1 1 ■ 1 111 1 11 

Log-likehhood threshold: 




u ~ Uniform[0, 1] 




log y <— log L(f ) + log u 


3. 


Draw an initial proposal, also defining a bracket: 




~ Uniform [0, 2vr] 




[e min ,6» max ] «- [0-27T, 61] 


4. 


f ' <— f cos 6 + vsinB 


5. 


if logL(f') > logy then: 


6. 


Accept: return f' 


7. 


else: 




Shrink the bracket and try a new point: 


8. 


if 9 < then: 6 m i n <—0 else: 6< max <— 6 


9. 


6 ~ Uniform [0 m in, 


10. 


GoTo 4. 



Figure 2: The elliptical slice sampling algorithm. 




Figure 3: (a) The algorithm receives f=X as input. Step 1 
draws auxiliary variate i/=Hr", defining an ellipse centred at 
the origin (o). Step 2: a likelihood threshold defines the 
'slice' ). Step 3: an initial proposal • is drawn, in this 
case not on the slice, (b) The first proposal defined both 
edges of the [Omin, S max ] bracket; the second proposal (•) 
is also drawn from the whole range, (c) One edge of the 
bracket ( — ) is moved to the last rejected point such that 
X is still included. Proposals are made with this shrinking 
rule until one lands on the slice, (d) The proposal here (•) 
is on the slice and is returned as f'. (e) Shows the reverse 
configuration discussed in Section 2.3: X is the input f', 
which with auxiliary i/=+ defines the same ellipse. The 
brackets and first three proposals (•) are the same. The 
final proposal (•) is accepted, a move back to f . 
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will denote the ordered sequence of angles considered 
during the algorithm by {Ok} with k = l..K. 

We first identify the joint distribution over a state 
drawn from the target distribution (2) and the other 
random quantities generated by the algorithm: 

p(f , y, v, {0 k }) = /(f) p(y | f ) p(u) P ({6 k } | f , u, y) 

= ijV(f;0,E)W(i/;0,S)p({e fc }|f,i/,y), (7) 

where the vertical level y was drawn uniformly 
in [0, L(f)], that is, p(y | f) = 1/L(f). The final term, 
p({0k} | f j y), is a distribution over a random-sized set 
of angles, defined by the stopping rule of the algorithm. 

Given the random variables in (7) the algorithm de- 
terministically computes positions, {ffc}, accepting the 
first one that satisfies a likelihood constraint. More 
generally each angle specifies a rotation of the two 
a priori Gaussian variables: 



v k = v cos 6 k — f sin k 
ft = v sin 9k + f cos 9k, k = 1..K. 



(8) 



For any choice of 9k this deterministic transformation 
has unit Jacobian. Any such rotation also leaves the 
joint prior probability invariant, 

7V(i/ fc ;0,E)7V(f fc ;0,E) = JV(i/;0,E)JV(f;0,E) (9) 

for all k, which can easily be verified by substituting 
values into the Gaussian form (f ). 

It is often useful to consider how an MCMC algorithm 
could make a reverse transition from the final state f 
back to the initial state f. The final state f' = fjf was 
the result of a rotation by 9k in (8). Given an initial 
state of f = fx, the algorithm could generate v' = v K 
in step 1. Then a rotation of —6k would return back 
to the original (f , v ) pair. Moreover, the same ellipse 
of states is accessible and rotations of 9k — 9k will 
reproduce any intermediate fk<K locations visited by 
the initial run of the algorithm. 



In fact, the algorithm is reversible: 

p{f,y,u,{e k })=p{i', y y,{e' k }), 



(10) 



the equilibrium probability of a forwards draw (7) is 
the same as the probability of starting at f, drawing 
the same y (possible because L(f')>y), i/' = i/ K and 



angles, 9' k 



Ok -0 K k<K 
-9 K k = K, 



(11) 



resulting in the original state f being returned. The 
reverse configuration corresponding to the result of a 
forwards run in Figure 3(d) is illustrated in Figure 3(e). 



Substituting (9) into (7) shows that ensuring that the 
forward and reverse angles are equally probable, 

p({e k }\f,u,y)=p({e , k }\f',u',y), (12) 

results in the reversible property (10). 

The algorithm does satisfy (12): The probability of the 
first angle is always i/W. If more angles were considered 
before an acceptable state was found, these angles were 
drawn with probabilities l/(6 max — 9 m - m ). Whenever 
the bracket was shrunk in step 8, the side to shrink must 
have been chosen such that ¥k remained selectable as it 
was selected later. The reverse transition uses the same 
intermediate proposals, making the same rejections 
with the same likelihood threshold, y. Because the 
algorithm explicitly includes the initial state, which in 
reverse is {k at 9' = 0, the reverse transition involves 
the same set of shrinking decisions as the forwards 
transitions. As the same brackets are sampled, the 
V(#max — #min) probabilities for drawing angles are the 
same for the forwards and reverse transitions. 

The reversibility of the transition operator (10) implies 
that the target posterior distribution (2) is a station- 
ary distribution of the Markov chain. Drawing f from 
the stationary distribution and running the algorithm 
draws a sample from the joint auxiliary distribution (7). 
The deterministic transformations in (8) and (11) have 
unit Jacobian, so the probability density of obtaining 
a joint draw corresponding to (f, y, v' , {0' k }) is equal 
to the probability given by (7) for the original vari- 
ables. The reversible property in (10) shows that this 
is the same probability as generating the variables by 
first generating f from the target distribution and gen- 
erating the remaining quantities using the algorithm. 
Therefore, the marginal probability of f is given by 
the target posterior (2). 

Given the first angle, the distribution over the first pro- 
posed move is Af(f cos 9, E sin 2 0). Therefore, there is a 
non-zero probability of transitioning to any region that 
has non-zero probability under the posterior. This is 
enough to ensure that, formally, the chain is irreducible 
and aperiodic (Tierney, 1994). Therefore, the Markov 
chain has a unique stationary distribution and repeated 
applications of elliptical slice sampling to an arbitrary 
starting point will asymptotically lead to points drawn 
from the target posterior distribution (2). 

2.4 Slice sampling variants 

There is some amount of choice in how the slice sampler 
on the ellipse could be set up. Other methods for 
proposing angles could have been used, as long as they 
satisfied the reversible condition in (12). The particular 
algorithm proposed in Figure 2 is appealing because it 
is simple and has no free parameters. 
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The algorithm must choose the initial edges of the 
bracket [# m i n , ^max] randomly. It would be aesthet- 
ically pleasing to place the edges of the bracket at 
the opposite side of the ellipse to the current position, 
at ±7r. However this deterministic bracket placement 
would not be reversible and gives an invalid algorithm. 

The edge of a randomly-chosen bracket could lie on 
the 'slice', the acceptable region of states. Our recom- 
mended elliptical slice sampling algorithm, Figure 2, 
would accept this point. The initially- presented algo- 
rithm, Figure 1, effectively randomly places the end- 
points of the bracket but without checking this location 
for acceptability. Apart from this small change, it can 
be shown that the algorithms are equivalent. 

In typical problems the slice will not cover the whole 
ellipse. For example, if f is a representative sample 
from a posterior, often — f will not be. Increasing the 
probability of proposing points close to the current 
state may increase efficiency One way to do this would 
be to shrink the bracket more aggressively (Skilling 
and MacKay, 2003). Another would be to derive a 
model from the auxiliary variable model (6), but with 
a non-uniform distribution on 9. Another way would 
be to randomly position an initial bracket of width less 
than 2tt — the code that we provide optionally allows 
this. However, as explained in section 2.2, for high- 
dimensional problems such tuning will often only give 
small improvements. For smaller problems we have 
seen it possible to improve the cpu-time efficiency of 
the algorithm by around two times. 

Another possible line of research is methods for biasing 
proposals away from the current state. For example 
the 'over-relaxed' methods discussed by Neal (2003) 
have a bias towards the opposite side of the slice from 
the current position. In our context it may be desirable 
to encourage moves close to 9 = tt/2, as these moves are 
independent of the previous position. These proposals 
are only likely to be useful when the likelihood terms are 
very weak, however. In the limit of sampling from the 
prior due to a constant likelihood, the algorithm already 
samples reasonably efficiently. To see this, consider 
the distribution over the outcome after N iterations 
initialized at f°: 

N N N 

f N = f° Y[ cos (9™ + ^ m sin0 m Y[ cos9 n , 

n—l m—1 n—ra+1 

where v n and 9 n are values drawn at iteration n. Only 
one angle is drawn per iteration when sampling from 
the prior, because the first proposal is always accepted. 
The only dependence on the initial state is the first 
term, the coefficient of which shrinks towards zero 
exponentially quickly. 



2.5 Limitations 

A common modeling situation is that an unknown 
constant offset, c ~ Af(Q, & m ), has been added to the 
entire latent vector f. The resulting variable, g = f+c, is 
still Gaussian distributed, with the constant a\ added 
to every element of the covariance matrix. Neal (1999) 
identified that this sort of covariance will not tend 
to produce useful auxiliary draws u. An iteration of 
the Markov chain can only add a nearly-constant shift 
to the current state. Indeed, covariances with large 
constant terms are generally problematic as they tend 
to be poorly conditioned. Instead, large offsets should 
be modeled and sampled as separate variables. 

No algorithm can sample effectively from arbitrary dis- 
tributions. As any distribution can be factored as in 
(2), there exist likelihoods L(f) for which elliptical slice 
sampling is not effective. Many Gaussian process appli- 
cations have strong prior smoothness constraints and 
relatively weak likelihood constraints. This important 
regime is where we focus our empirical comparison. 

3 Related work 

Elliptical slice sampling builds on a Metropolis- 
Hastings (M-H) update proposed by Neal (1999). Neal 
reported that the original update performed moderately 
better than using a more obvious M-H proposal, 

f' = f + ei/, i/~./V(0,E), (13) 

and much better than Gibbs sampling for Gaussian 
process classification. Neal also proposed using Hy- 
brid/Hamiltonian Monte Carlo (Duane et al., 1987; 
Neal, 1993), which can be very effective, but requires 
tuning and the implementation of gradients. We now 
consider some other alternatives that have similar re- 
quirements to elliptical slice sampling. 

3.1 'Conventional' slice sampling 

Elliptical slice sampling builds on the family of methods 
introduced by Neal (2003). Several of the existing 
slice sampling methods would also be easy to apply: 
they only require point-wise evaluation of the posterior 
up to a constant. These methods do have step-size 
parameters, but unlike simple Metropolis methods, 
typically the performance of slice samplers does not 
crucially rely on carefully setting free parameters. 

The most popular generic slice samplers use simple 
univariate updates, although applying these directly 
to f would suffer the same slow convergence problems 
as Gibbs sampling. While Agarwal and Gelfand (2005) 
have applied slice sampling for sampling parameters 
in Gaussian spatial process models, they assumed a 
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linear-Gaussian observation model. For non-Gaussian 
data it was suggested that "there seems to be little role 
for slice sampling." 

Elliptical slice sampling changes all of the variables in 
f at once, although there are potentially better ways of 
achieving this. An extensive search space of possibilities 
includes the suggestions for multivariate updates made 
by Neal (2003). 

One simple possible slice sampling update performs a 
univariate update along a random line traced out by 
varying e in (13). As the M-H method based on the 
line worked less well than that based on an ellipse, one 
might also expect a line-based slice sampler to perform 
less well. Intuitively, in high dimensions much of the 
mass of a Gaussian distribution is in a thin ellipsoidal 
shell. A straight line will more rapidly escape this shell 
than an ellipse passing through two points within it. 

3.2 Control variables 

Titsias et al. (2009) introduced a sampling method 
inspired by sparse Gaussian process approximations. 
M control variables f c are introduced such that the 
joint prior p(f , f c ) is Gaussian, and that f still has 
marginal prior Af(0, E). For Gaussian process models a 
parametric family of joint covariances was defined, and 
the model is optimized so that the control variables 
are informative about the original variables: p(f | f c ) 
is made to be highly peaked. The optimization is a 
pre-processing step that occurs before sampling begins. 

The idea is that the small number of control variables f c 
will be less strongly coupled than the original variables, 
and so can be moved individually more easily than the 
components of f . A proposal involves resampling one 
control variable from the conditional prior and then 
resampling f from p(f | f c ). This move is accepted or 
rejected with the Metropolis-Hastings rule. 

Although the method is inspired by an approximation 
used for large datasets, the accept/reject step uses the 
full model. After 0(N 3 ) pre-processing it costs 0(N 2 ) 
to evaluate a proposed change to the A^-dimensional 
vector f . One 'iteration' in the paper consisted of an 
update for each control variable and so costs 0(MN 2 ) 
- roughly M elliptical slice sampling updates. The 
control method uses fewer likelihood evaluations per 
iteration, although has some different minor costs asso- 
ciated with book-keeping of the control variables. 

3.3 Local updates 

In some applications it may make sense to update only 
a subset of the latent variables at a time. This might 
help for computational reasons given the 0(N 2 ) scaling 
for drawing samples of subsets of size N . Titsias et al. 



(2009) also identified suitable subsets for local updates 
and then investigated sampling proposals from the 
conditional Gaussian prior. 

In fact, local updates can be combined with any tran- 
sition operator for models with Gaussian priors. If 
f_A is a subset of variables to update and are the 
remaining variables, we can write the prior as: 



■AT 



^A,A 

Era 



^A,B 
^B,B 



(14) 



and the conditional prior is: 



p(fA|fs) =AT{{a\ m, S), where 

m = T,a,b^ b ]b^b, and £ = Saa-Sa B E b | b S b ^. 

A change of variables g = f a— m allows us to express the 
conditional posterior as: p*(g) oc Af(g; 0, S) L(^ S +™J). 
We can then apply elliptical slice sampling, or any al- 
ternative, to update g (and thus f^). Updating groups 
of variables according to their conditional distributions 
is a standard way of sampling from a joint distribution. 

4 Experiments 

We performed an empirical comparison on three Gaus- 
sian process based probabilistic modeling tasks. Only 
a brief description of the models and methods can 
be given here. Full code to reproduce the results is 
provided as supplementary material. 

4.1 Models 

Each of the models associates a dimension of the latent 
variable, /„, with an 'input' or 'feature' vector x„. The 
models in our experiments construct the covariance 
from the inputs using the most common method, 

S« = <rj exp(- \ Yld=i( x d,i ~ Xd^/t 2 ), (15) 

the squared-exponential or "Gaussian" covariance. 
This covariance has "lengthscale" parameter I and 
an overall "signal variance" a 2 . Other covariances may 
be more appropriate in many modeling situations, but 
our algorithm would apply unchanged. 

Gaussian regression: given observations y of the 
latent variables with Gaussian noise of variance a 2 , 



Mf) = Il n JV(»n;/n,0. 



(16) 



the posterior is Gaussian and so fully tractable. We 
use this as a simple test that the method is working 
correctly. Differences in performance on this task will 
also give some indication of performance with a simple 
log-concave likelihood function. 
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We generated ten synthetic datasets with input fea- 
ture dimensions from one to ten. Each dataset was 
of size iV = 200, with inputs {x n }^ =1 drawn uniformly 
from a D-dimensional unit hypercube and function 
values drawn from a Gaussian prior, f ~7V(0, £), using 
covariance (15) with lengthscale £=1 and unit signal 
variance, ah = 1. Noise with variance a\ 
added to generate the observations. 
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Gaussian process classification: a well-explored 
application of Gaussian processes with a non-Gaussian 
noise model is binary classification: 



£c(f) = rL*(y»/n). 



(17) 



where y n 6 {— 1, +1} are the label data and a (a) is a 
sigmoidal function: l/(l + e~ a ) for the logistic classifier; 
a cumulative Gaussian for the probit classifier. 

We ran tests on the USPS classification problem 
as set up by Kuss and Rasmussen (2005). We 
used log (Jf = 3.5, log^ = 2.5 and the logistic likelihood. 

Log Gaussian Cox process: an inhomogeneous Pois- 
son process with a non-parametric rate can be con- 
structed by using a shifted draw from a Gaussian pro- 
cess as the log-intensity function. Approximate infer- 
ence can be performed by discretizing the space into 
bins and assuming that the log-intensity is uniform in 
each bin (M0ller et al., 1998). Each bin contributes a 
Poisson likelihood: 



L„(f) = n 



\ n Vn exp(-A„) 



A„ = e / " +m , (18) 



where the model explains the y n counts in bin n as 
drawn from a Poisson distribution with mean A„. The 
offset to the log mean, m, is the mean log-intensity of 
the Poisson process plus the log of the bin size. 

We perform inference for a Cox process model of the 
dates of mining disasters taken from a standard data 
set for testing point processes (Jarrett, 1979). The 
191 events were placed into 811 bins of 50 days each. 
The Gaussian process parameters were fixed to ah = 1 
and ^ = 13516 days (a third of the range of the dataset). 
The offset m in (18) was set to m = log(191/811), to 
match the empirical mean rate. 

4.2 Results 

A trace of the samples' log-likelihoods, Figure 4, shows 
that elliptical slice sampling and control variables sam- 
pling have different behavior. The methods make differ- 
ent types of moves and only control variables sampling 
contains rejections. Using long runs of either method 
to estimate expectations under the target distribution 
is valid. However, sticking in a state due to many 
rejections can give a poor estimator as can always mak- 
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Figure 4: Traces of log-likelihoods for the 1-dimensional GP 
regression experiment. Both lines are made with 333 points 
plotted after each sweep through M = 3 control variables 
and after every 3 iterations of elliptical slice sampling. 
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Figure 5: As in Figure 4 but for 10-dimensional regression 
and plotting every M = 78 iterations. (Control variables 
didn't move on this run.) 



ing small moves. It can be difficult to judge overall 
sampling quality from trace plots alone. 

As a quantitative measure of quality we estimated the 
"effective number of samples" from log-likelihood traces 
using R-CODA (Cowles et al., 2006). Figure 6 shows 
these results along with computer time taken. The step 
size for Neal's Metropolis method was chosen using a 
grid search to maximize performance. Times are for the 
provided implementations under Matlab v7.8 on a sin- 
gle 64 bit, 3 GHz Intel Xeon CPU. Comparing runtimes 
is always problematic, due to implementation-specific 
details. Our numbers of effective samples are primarily 
plotted for the same number of 0(N 2 ) updates with 
the understanding that some correction based loosely 
on runtime should be applied. 

The control variables approach was particularly recom- 
mended for Gaussian processes with low-dimensional 
input spaces. On our particular low-dimensional 
synthetic regression problems using control variables 
clearly outperforms all the other methods. On the 
model of mining disasters, control variable sampling 
has comparable performance to elliptical slice sampling 
with about 50% less run time. On higher-dimensional 
problems more control variables are required; then 
other methods cost less. Control variables failed to sam- 
ple in high-dimensions (Figure 5). On the USPS classifi- 
cation problem control variables ran exceedingly slowly 
and we were unable to obtain any meaningful results. 

Elliptical slice sampling obtained more effective samples 
than Neal's M-H method with the best possible step 
size, although at the cost of increased run time. On the 
problems involving real data, elliptical slice sampling 
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Figure 6: Number of effective samples from 10 5 iterations after f0 4 burn in, with time and likelihood evaluations required. 
The means and standard deviations for 100 runs are shown (divide the "error bars" by 10 to get standard errors on the 
mean, which are small). Each iteration involves one 0(N 2 ) operation (e.g. one u draw or updating one control variable). 
Each group of bars in the top two rows has been rescaled for readability: the numbers beneath each group show the 
number of effective samples or CPU time in seconds for elliptical slice sampling, which always has bars of height 1. 



was better overall whereas M-H has more effective 
samples per unit time (in our implementation) on the 
synthetic problems. The performance differences aren't 
huge; either method would work well enough. 

Elliptical slice sampling takes less time than slice sam- 
pling along a straight line (line sampling involves addi- 
tional prior evaluations) and usually performs better. 

5 Discussion 

The slice samplers use many more likelihood evaluations 
than the other methods. This is partly by choice: 
our code can take a step-size parameter to reduce the 
number of likelihood evaluations (Section 2.4). On 
these problems the time for likelihood computations 
isn't completely negligible: speedups of around x 2 may 
be possible by tuning elliptical slice sampling. Our 
default position is that ease-of-use and human time is 
important and that the advantage of having no free 
parameters should often be taken in exchange for a 
factor of two in runtime. 

We fixed the parameters of £ and L in our experi- 
ments to simplify the comparison. Fixing the model 
potentially favors the methods that have adjustable 
parameters. In problems were £ and L change dra- 
matically, a single step-size or optimized set of control 
variables could work very poorly. 

Elliptical slice sampling is a simple generic algorithm 
with no tweak parameters. It performs similarly to the 
best possible performance of a related M-H scheme, 
and could be applied to a wide variety of applications 
in both low and high dimensions. 
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